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Abstract 

The “Collective Intelligence” (COIN) framework concerns 
the design of collectives of reinforcement-learning agents 
such that their interaction causes a provided “world” utility 
function concerning the entire collective to be maximized. 
Previously, we applied that framework to scenarios involv- 
ing Markovian dynamics where no re-evolution of the sys- 
tem from counter- factual initial conditions (an often expen- 
sive calculation) is permitted. This approach sets the indi- 
vidual utility function of each agent to be both aligned with 
the world utility, and at the same time, easy for the associ- 
ated agents to optimize. Here we extend that approach to sys- 
tems involving non-Markovian dynamics. In computer simu- 
lations, we compare our techniques with each other and with 
conventional- team -games”- We show whereas in team games 
performance often degrades badly with time, it steadily im- 
proves when our techniques are used. We also investigate 
situations where the system’s dimensionality is effectively re- 
duced. We show that this leads to difficulties in the agents’ 
ability to learn. The implication is that “learning” is a prop- 
erty only of high-enough dimensional systems. 

Introduction 

In this paper we are concerned with large distributed col- 
lectives of interacting goal-driven computational processes, 
where there is a provided ‘world utility’ function that rates 
the possible behaviors of that collective (Wolpert, Turner, 
& Frank 1999; Wolpert & Turner 1999). We are particu- 
larly concerned with such collectives where the individual 
computational processes use machine learning techniques 
(e g , Reinforcement Learning (RL) (Kaelbing, Littman, & 
Moore 1996; Sutton & Barto 1998; Sutton 1988; Watkins & 
Dayan 1992)) to try to achieve their individual goals. We 
represent those goals of the individual processes as maxi- 
mizing an associated ‘payoff 1 utility function, one that in 
general can differ from the world utility. 

° In such a system, we are confronted with the following in- 
verse problem: How should one initialize/update the payoff 
utility functions of the individual processes so that the ensu- 
ing behavior of the entire collective achieves large values of 
the provided world utility? In particular, since in truly large 
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systems detailed modeling of the system is usually impossi- 
ble, how can we avoid such modeling? Can we instead lever- 
age the simple assumption that our leamering algorithms are 
individually fairly good at what they do to achieve a large 
world utility value? 

We are concerned with payoff utility functions that are 
“aligned” with the world utility, in that modifications a 
player might make that would improve its payoff utility also 
must improve world utility. 1 Fortunately the equivalence 
class of such payoff utilities extends well beyond team-game 
utilities. In particular, in previous work we used the Collec- 
tive INtelligence (COIN) framework to derive the ‘Wonder- 
ful Life Utility’ (WLU) payoff function (Wolpert & Turner 
1999) as an alternative to a team-game payoff utility. The 
WLU is aiigned with’ world utility, as desired. In addi- 
tion though, WLU overcomes much of the signal-to-noise 
problem of team game utilities (Turner & Wolpert 2000, 
Wolpert, Turner, & Frank 1999; Wolpert & Turner 1999; 
Wolpert, Wheeler, & Turner 2000). 

In a recent paper, we extended the COEN framework 
with an approach based on Transforming Arguments Util- 
ity functions (TAU) before the evaluation of those functions 
(Wolpert & Lawson 2002). The TAU process was originally 
designed to be applied to the individual utility functions of 
the agents in systems in which the world utility depends on 
the final state in an episode of variables outside the collec- 
tive that undergo Markovian dynamics, with the update rule 
of those variables reflecting the state of the agents at the be- 
ginning of the episode. This is a very common scenario, ob- 
taining whenever the agents in the collective act as control 
signals perturbing the evolution of a Markovian system. 

In the pre-TAU version of the COIN framework, to 
achieve good signal-to-noise for such scenarios requires 
knowing the evolution operator. However it also flight 
require re-evolving the system from counter-factual initial 
states of the agents to evaluate each agent’s reward for a 
particular episode. This can be computationally expensive. 
With TAU utility functions no such re-evolving is needed; 
the observed history of the system in the episode is trans- 
formed in a relatively cheap calculation, and then the util- 

1 Such alignment can be viewed as an extension of the concept 
of incentive compatibility in mechanism design (Fudenberg & Ti 
role 1991) to non-human agents, off-equilibrium behavior, etc. 


ity function is evaluated with that transformed history rather 
than the actual one. 

The TAU process has other advantages that apply even in 
scenarios not involving Markovian dynamics. In particular it 
allows us to employ the COIN framework even when not all 
arguments of the original utility function are observable, due 
for example to communication limitations. In addition, cer- 
tain types of TAU transformations result in utility functions 
that are not exactly aligned with the world utility, but have 
so much better signal-to-noise that the collective performs 
better when agents use those transformed utility functions 
than it does with exactly aligned utility functions. 

Here we investigate the extension of the TAU process 
to systems with non-Markovian dynamics where the world 
utility is the same function of the state of the system at every 
moment in time. To do this we have the agents operate on 
very fast time-scales compared to that dynamics, i.e., have 
the time-steps at which they make their successive moves be 
very closely packed. ^Ve also have the moves of the agents 
consist of very small perturbations to the underlying vari- 
ables of the system rather than the direct setting of those 
variables. Now since the world utility is defined for every 
moment in time, there is a surface taking the values of those 
underlying variables at any time-step to the associated value 
of the world utility. So the problem for the agents is one of 
traversing that surface to try to get to values of the underly- 
ing variables to have a good associated world utility. 

Since the time-scales are so small though, we can approx- 
. imate, the effects of the agents’ moves at any time-step of the 
value of the world utility at the next time-step as though the 
intervening evolution were linear (Markovian). Now, as in 
the original TAU work, assume for simplicity that that linear 
dynamics is known for each such time- step. Then at each 
tjjjie-step the problem is reduced to the exact same one that 
was addressed in that original TAU work. 

Unlike in that original work though, here the linear rela- 
tion between the moves of the agents and the resultant value 
of the world utility at the next time-step changes from one 
time-step to the next, as both the underlying variables of the 
system change as does the associated gradient. Accordingly, 
the mapping the agents are trying to learn from their moves 
to the resultant rewards changes in time. 

Here we do not confront this nonstationarity. We use a set 
of computer experiments to compare use of the TAU process 
to set the utility functions of agents to the alternative con- 
ventional approach of “team games’ in this non-Markovian 
domain. We verify that the TAU process outperforms this al- 
ternative. In particular, in many experiments the team game 
resulted in world utility values that decrease with time, i.e., 
the agents steer the underlying variables to worse and worse 
values. In contrast, the TAU process steer the underlying 
variables in such a way that improved world utility with 
time. 

We also investigate what happens as the underlying sys- 
tem is modified so that the moves of the individual agents 
become less and less consequential to the dynamics. Intu- 
itively, one would expect in such a case that the system s 
effective dimensionality gets reduced, while the agents also 
have a harder time learning. We present tentative evidence 


corroborating this prediction. The implication is that learn 
ing” is a property only of high-enough dimensional systems. 

The Mathematics of Collective Intelligence 

We view the individual agents in the collective as players in- 
volved in a repeated game."' Let Z with elements £ be the 
space of possible joint moves of all players in the collec- 
tive in some stage. We wish to search for the Q that maxi- 
mizes a provided world utility tr(C). In addition to G we 
are concerned with utility functions one such function 
for each variable/player r?. We use the notation J) to refer to 
all players other than rj. 

Intelligence and the central equation 

We wish to “standardize” utility functions so that the nu- 
meric value they assign to a C only reflects their ranking of 
( relative to certain other elements of Z, We call such a 
standardization of an arbitrary utility U for player rj the in- 
telligence for 7 ] at C with respect to U”. Here we will use 
intelligences that are equivalent to percentiles. 

eutt :V) = j d» C -„ (C'W(0 - V(C')1 . CD 

where the Heaviside function 0 is defined to equal 1 when 
its argument is greater than or equal to 0, and to equal 0 oth- 
erwise, and where the subscript on the (normalized) measure 
d/i indicates it is restricted to Q' sharing the same non-p com- 
ponents’ as C- In general the measure must reflect the type 
of system at hand, e.g., whether Z is countable or not, and if 
not, what coordinate system is being used. Other than that, 
any convenient choice of measure may be used and the the- 
orems will still hold. Intelligence value are always between 
0 and 1. 

Our uncertainty concerning the behavior of the system is 
reflected in a probability distribution over Z . Our ability 
to control the system consists of setting the value of some 
characteristic of the collective, e.g., setting the functions of 
the players. Indicating that value by s, our analysis revolves 
around the following central equation for P(G \ s ), which 
follows from Bayes’ theorem: 

P(G | s) = f <1 cgP{G | cg, s) 

J de g P(e G \e g ,s)P(e g \s) , (2) 

where e g = (e 9rtl (C : m), (C : m), ’ “ * e vector of 

the intelligences of the players with respect to then associ- 
ated functions, and e G = (cg(C : *7i)i £ g(C : %)>•■') is the 
vector of the intelligences of the players with respect to G. 

Note that e 9n (C : rj) = 1 means that player r? is fully 
rational at £, in that its move maximizes its utility, given 
the moves of the players. In other words, a point £ where 

2 The full mathematics of the COIN framework, however, ex- 
tends significantly beyond what is needed to address such games. 
See (Wolpert & Turner 2001). 



e (r : ri) = 1 for all players 77 is one that meets the def- 
inition of a game-theory Nash equilibrium (Fudenberg & 
Tirole 1991). Note that consideration of points ( at which 
not all intelligences equal 1 provides the basis for a model- 
independent formalization of bounded rationality game the- 
ory a formalization that contains variants of many of the the- 
orems of conventional full-rationality game theory (Wolpert 
2001a). On the other hand, a ( at which all components 01 
Z G - 1 is a local maximum of G (or more precisely, a criti- 
cal point of the <7(0 surface). 

If we can choose s so that the third conditional probability 
in the integrand is peaked around vectors e g all of whose 
components are close to 1, then we have likely induced large 
intelligences. If in addition the second term is peaked about 
S, equal to then e G will also be large. Finally if the 
first term is peaked about high G when e G is large, then our 
choice of s will likely result in high G, as desired. 

Intuitively, the requirement that the utility functions have 
high “signal-to-noise” (an issue not considered in conven- 
tional work in mechanism design) arises in the third term. 
It is in the second term that the requirement that the util- 
ity functions be “aligned with <7” arises. In this work we 
concentrate on these two terms, and show how to simultane- 
ously set them to have the desired form. 

Details of the stochastic environment in which the collec- 
tive operates, together with details of the learning algorithms 
of the players, are reflected in the distribution P(Q whic 
underlies the distributions appearing in Equation 2. Note 
though that independent of these considerations, our desired 
form for the second term in Equation 2 is assured if we have 
chosen utility utilities such that e g equals e G exactly for all 
C We call such a system factored. In game-theory lan 
euage the Nash equilibria of a factored collective are local 
maxima of G. In addition to this desirable equilibrium be- 
havior, factored collectives automatically provide appropri- 
ate off-equilibrium incentives to the players (an issue rarely 
considered in game theory / mechanism design). 

Opacity 

We now focus on algorithms based on utility functions {g v } 
that optimize the signal/noise ratio reflected in die third 
term, subject to the requirement that the system be factored. 
To understand how these algorithms work, given a measure 
dpifr,), define the opacity at ( of utility U as: 

r \u(0 -U(C‘rpQ\ ,,, 

fMC •• v, s) = J dCAC I 0 1^(0 _ I7(c-„c;)| ’ 

where J is defined in terms of the underlying probability dis- 
tributions, 3 and (£,, Cn) is defined as the worldlme whose r/ 

3 Writing it out in full, J{ C' | C) = I S )/A(C» I 

C-„,s), with: 

P(Q I 1 Ct?, g)M(Cr 7 ) + 

v?> C I — 2 

p(C 1 ci, 5 )p(c.ic;^)m(c.) 


components are the same as those of C' while its 7 ? compo- 
nents are the same as those of C ((Wolpert & Turner 2001)). 

The denominator absolute value in the integrand in Equa- 
tion 3 reflects how sensitive (7(C) is to changing Cr In 
contrast, the numerator absolute value reflects how sensitive 
(7(0 is to changing Cri- So the smaller the opacity of a util- 
ity function g v , the more g n (0 depends only on the move of 
player 77 , i.e., the better the associated signal-to-noise ratio 
for 77 . Intuitively then, lower opacity should mean it is easier 
for n to achieve a large value of its intelligence. 

To formally establish this, we use the same measure dp 
to define opacity as the one that defined intelligence. Under 
this choice expected opacity bounds how close to 1 expected 
intelligence can be (Wolpert & Turner 2001). 

E(eu( C : t?) | s) < 1 - A', where 
K < A(fMC = V, s ) I s )- (5) 

So low expected opacity of utility g v ensure that a necessary 
condition is met for the third term in Equation 2 to have the 
desired form for player 77 . While low opacity is not, formal y 
speaking, also sufficient for A(«t/ (C ■ v) \ s ) to be close to 
1, in practice the bounds in Equation 5 are usually tight. 

Difference Utilities 

It is possible to solve for the set of all utilities that are fac- 
tored with respect to a particular world utility. Unfortu- 
nately, in general it is not possible for a collective both to 
be factored and to have zero opacity for all of its players. 
However consider difference utilities, which are of the form 

(7(0 = <7(0 - r(/(0) (6) 

where T(f) is independent of ( n - Any difference utility 
is factored (Wolpert 2001b), and under benign approxima- 
tions, E{(l u I s ) is minimized over the set of such utilities 

by choosing 

r(/(0) = E ( G I Cv».«) > (7) 

up to an overall additive constant. We call the resultant dif- 
ference utility the Aristocrat utility (AU), loosely reflecting 
the fact that it measures the difference between a player s 
actual action and the average action. 

The COIN Framework for Systems with 
Markovian Evolution 

We consider games which consist of multi-step “episodes . 
Within each episode the entire system evolves in a Marko- 
vian manner from the initial moves of the players. We are 
interested in such games where some of the players 77 are not 
agents whose initial state is under control of a learmng algo- 
rithm that we control, but rather constitute an e ™ ron ™^ 
for those controllable agents (i.e., where some of the players 

correspond to the state of nature) . « 

Let A be the Markovian single step evolution operator o 

the entire system through an episode, 

Ct = ^Ct-i < 8 > 

Each component (? , for example, could be a one 
dimensional real number. The row vector A v would then 



be rf s update rule. Alternatively, each agent could be repre- 
sented by one of N symbolic- values. In that case, would 

be given in a unary representation as a vector in 1Z (i.e. 
a Haar basis). Considering such large spaces are necessary 
to describe arbitrary, nonlinear dynamics as Markovian evo- 
lution. Here we will concentrate on the former case, where 
the moves of the players are all real numbers. 

The full multiple time step evolution of an episode is 
given by single step operator in the usual way: Let 

m 


L -4 T J 

where T is the number of timesteps per episode. This opera- 
tor applied to our initial state Co yields the entire “worldline” 
£ or time history, of the system. 

C = CQo- W 

We consider difference utility functions of the form 

g n {Q — G(C(o) — L v {F v CCo) (10) 

where G is the world utility function to_be optimized. We 
will choose F v so that the product F V C Co is independent of 
agent 77’s actions. This is a necessary and sufficient condi- 
tion for the associated difference utility g v {() to be factored 
with respect to the world utility G for any and all choices 
of T„. In general, T„ can be chosen in such a way to op- 
timize leamability. Here though, for simplicity, we choose 
r - G. Accordingly, application of the F n operator is an 
instance of transforming the argument of the (second term of 
the) utility functions of the agents, i.e., it is a TAU process. 


Observability restrictions 

In practice, the full worldline of the system may not be fully 
observable to each agent. Such limited observability of a 
particular component may be determined by the problem. In 
other cases, due to communication constraints each agent is 
only allowed to observe a certain number of components, 
and must select which such components to observe, for ex- 
ample to optimize some auxiliary quantity like opacity. Sim- 
ilarly, the dynamics may not be known exactly to the agent; 
some’ rows of C may be uncertain to an agent, or simply 
cannot be determined. In these kinds of situations the de- 
scribed above cannot be evaluated at the end of an episode 
by agent 77, even if the value G{£t is globally broadcast to all 
agents. 

The TAU approach outlined above is well-suited to ad- 
dress such situations. Formally, a decimated identity oper- 
ator L can be defined whose diagonal elements are (0, 1) 
depending on whether or not they are observable. The cor- 
responding factored utility for agent 77 is 

9n(£) = G{Ct)-G(LF v &, < n ) 


where in general L may vary with 77. Given global broadcast 
to all agents of the value of G(( t ), for each agent to evaluate 
this type of g„ only requires that those components of F n Qt 
that are non-zero (and therefore can vary) after application 
of the L operator be observed. 

This difference utility has two main sources of noise, one 
from potentially poor choice of the clamping operator, and 
the other from the use of L in the second (subtracted) term 
but not in the first. To address that latter source of noise we 
can impose limited observability on the first term in addition 
to the second one, getting 

g n (C t ) = G(LC t )-G(LF v C t ). (12) 

The new utility is not factored with respect to G. Ac- 
cording to the central equation however, it may still result in 
better performance than when we don’t have L in the first 
term, if the improvement in opacity more than offsets the 
loss of exact factoredness. In addition to the potential for 
such far superior opacity, this utility has the added advan- 
tage that now we don’t even need to rely on global broadcast 

of G(LQ ) to evaluate g v ■ 

The non-Markovian case 

To address the general nonlinear problem, we assign each 
agent a real- valued number r v . The state of the system Ct 
is the Cartesian product of each agent’s action and r t . Each 
agent can choose among three actions which add one of the 
values {±A, 0 } to r n . Nonlinear evolution then occurs to 

r, to produce the value at the end of this episode, Ct+i = 
ci(Ct)- That value then serves as the argument of G. 
Construction of factored utilities 

< 77 j(Ct+i) = G(ct(Ct)) ~ G(c t (CLCt )) (12) 

requires that c t (Ct) be independent of 77’s choice of action. 
One way to accomplish this to clamp (apply CL) to Ct <md 
re-evolve the system. To avoid re-evolving the system, we 
approximate c t (CL( t ) with a Taylor Series expansion about 
the undamped Ct starting state: 

Ct(CLCt) = ct(Ct) + A(Ct - CLCt) ■ VtCt(Ct)- (I 4 ) 

Assuming not all components of Ct e fi ual °< v ( e can recast 
this as as the multiplication of a matrix times Ct. where that 
matrix is indexed by time. In doing this we reduce the sys- 
tem to the linear case, only with a time-dependent update 
matrix. 

Note that varying A provides us a small parameter to con- 
trol the expansion. It should also be noted that whi e is 
method requires that c t (C) be differentiable, the world util- 
ity G need not be. 

Experiments 

Numerical simulations were performed with 50 agents^ Af- 
ter an initial 100-episode training period, agents selected ini- 
tial actions in each subsequent episode with the same re- 
inforcement learning algorithm used in our previous work. 


WofW Ulity 


All players experienced a quadratic/nonlinear update rule 
c(Co) = Di j °».i r o r o ^ at depends are agents “position 
{r*}. The coefficients are randomly generated. The world 
utility function was a spin glass, 

Gr = £ < i5 > 

i<j 

The agents are given a random initial starting point with 
-1 < r <1. Because cis quadratic, G(Ct) is a quartic 
polynomial in N dimensions. Since the coefficients {a,j} 
have random signs, the function G has as many increasing 
directions as it decreasing directions. The goal of the system 
is to traverse this high dimensional surface, find an increas- 
ing direction, and then follow that direction out to infinity. 

We collected statistics by averaging runs over many ran- 
domly set coefficients aij and coupling constants J t j • These 
runs were for systems whose first 25% and / 5% components 
at the end of the episode are observable, given some canoni- 
cal ordering of agents. We examined (Figure 1) world utility 
value vs. episode number for six utility functions. 

1) TAU g for a fully observable system; 

2) TAU g for 75 % observability, g 75% ; 

3) The modification g'^J* giving a non-factored system, 

again with 75 % observability; 

4 ) ^25% f or a factored system with 25 % observability; 

5 ) ^25% f or a n on-factored system with 25 % observabil- 
ity. , r 

6) The team game, where every g n = ^ 

Even the results for limited observability clearly outper- 
form the corresponding team game in which there is full 
observability. Furthermore, for 75% observability, the non- 
factored utilities (L in both terms) consistently outperform 
their factored counterpart. In these runs factoredness fell to 
approximately 90%. The improvement in performance due 
to better signal-to-noise more than outweighs the degrada- 
tion due to loss in factoredness. 
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Figure 1: System performance for N - 50 agents using 
the Taylor Series method. The dynamics is governed by 
a quadratic function of the agents’ “positions”. The world 
utility G is a quartic in N dimensions, (upper two graphs 
are g and <£f ; middle two are g 2 n f and <? 75% ; lower two 
are g 25% and a team game G.) The initial training period is 
not shown. 



Figure 2; Taylor Series method where the quadratic coeffi- 
cients have more - than + signs, (graphs; upper pair are g 
and g^ f % ; middle three are g 75% , g 25% , and the team game; 
loweris gf7°-) In this case, three of the limited observabil- 
ity utilities and the team game perform worse over time (i.e. 
their world utilities decrease). 



It is interesting to adjust the ratio of ± signs in the co- 
■fficients of the polynomials. If we introduce, for examp e, 
nore negative coefficients than positive, we expect the sur- 
ace to preferentially mm down. The task for the agents 
lecomes more challenging. We find (Figure 2), in fac , 
hat three of the limited observability utilities perform worse 
>ver time (i.e. their world utility decreases). The team game 
ilso performs worse over time. In fact, not only does e 
earn game give poor performance, but it fails altogeffier. 
rhe lowest noise TAU utilities g and g 7 n f still give robust 
)erformance. 

In this case, the team game gives worse performance than 
l random walk i.e. no learning is happening. In fact, the 
; ystem executes essentially determistic, nonlinear behav- 
or (Figure 3). Remarkably, as we increase the data aging 
jarmeter (weighting more heavily data that appeared further 
n the past), the system becomes even more exotic, closely 
esembling a low-dimensional nonlinear system. By aging 
he data more severely, we effectively damp out a large por- 
ion of the degrees of freedom stored in the agents traili- 
ng sets, hence the lower dimensionality. Learning, it would 
;eem, is possible only in higher-dimensional systems. 





Conclusion 

We present a detailed extension of the COIN framework to 
systems that undergo non-Markovian evolution. This builds 
on previous work where the Markovian case (Wolpert & 
Lawson 2002) was considered. The approach is applied 
to systems with nonlinear update rules using a perturbative 
technique. Results from numerical simulations find consis- 
tent, robust improvement of performance as compared to the 
conventional team game. 

This framework naturally includes the case of limited ob- 
servability. We found that even COIN-based utility func- 
tions constrained by limited observability often outper- 
formed team game utilities having full observability. We 
also found a new class of nonfactored utilities that consis- 
tently outperformed their factored counterpart, due to im- 
proved signal-to-noise characteristics. 

We find that the system’s performance can depend on the 
characteristics of the surface being optimized. We show that 
in some situations a team game will fail altogether (i.e. its 
performance will degrade over time) while the correspond- 
ing TAU utility continues to perform well. In this ”non- 
Ieaming regime”, the system executes interesting determin- 
istic, nonlinear behavior, indicative of low-dimensional sys- 
tems. 
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